Goto

Collaborating Authors

 event detection


Joint Activity Design Heuristics for Enhancing Human-Machine Collaboration

Jalaeian, Mohammadreza, Morey, Dane A., Rayo, Michael F.

arXiv.org Artificial Intelligence

-- Joint activity describes when more than one agent (human or machine) contributes to the completion of a task or activity. Designing for joint activity focuses on explicitly supporting the interdependencies between agents necessary for effective coordination amon g agents engaged in the joint activity. This builds and expands upon designing for usability to further address how technologies can be designed to act as effective team players. Effective joint activity requires supporting, at minimum, five primary macroc ognitive functions within teams: Event Detection, Sensemaking, Adaptability, Perspective - Shifting, and Coordination. Supporting these functions is equally as important as making technologies usable. We synthesized fourteen heuristics from relevant literatu re including display design, human factors, cognitive systems engineering, cognitive psychology, and computer science to aid the design, development, and evaluation of technologies that support joint human - machine activity . Recent advances in Artificial Intelligence (AI) and Machine Learning (ML) technologies have accelerated human - machine interactions progress ing from simple tool - based engagements to complex cognitive collaborations [1] . Machines are being designed to perform an increasing set of functions and are being expected to engage more deeply in the collaborative joint activit ies related to these functions. This shift in machine capabilities and expectations demands a corresponding re - evaluation and broadening of design and evaluation principles to support joint human - machine activity in ways that lie outside the boundaries of trad itional usability methods and models [2] . Traditional usability heuristics, such as those proposed by [3], provide a strong foundation focusing primarily on surface - level interactions such as enhancing the ease of use, efficiency, and satisfaction in human - machine interaction . These heuristics are primarily oriented towards actions and responses but offer limited support for the essential macrocognitive functions associated with effective teamwork including event detection, sensemaking, adaptability, perspective shifting, and co ordination, all of which are vital in the close collaboration of humans and machine s with joint activities [2], [4], [5], [6] . These heuristics are primarily oriented towards actions and responses but offer limited support for the essential macrocognitive functions associated with effective teamwork including event detection, sensemaking, adaptability, perspective shifting, and co ordination . A ll of these macrocognitive functions are vital in the close collaboration of humans and machines with joint activities in high - stakes and dynamic environments with little room for error [2], [5] . This reliance on macrocognitive functions is evident in domains where the ability to process complex information and adapt to changing conditions is crucial.


Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Sakajo, Haruki, Takato, Hiroshi, Tsutsui, Hiroshi, Soda, Komei, Kamigaito, Hidetaka, Watanabe, Taro

arXiv.org Artificial Intelligence

Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.


DreamCatcher: A Wearer-aware Sleep Event Dataset Based on Earables in Non-restrictive Environments

Neural Information Processing Systems

Widely available earbuds equipped with sensors (also known as earables) can be combined with a sleep event detection algorithm to offer a convenient alternative to laborious clinical tests for individuals suffering from sleep disorders. Although various solutions utilizing such devices have been proposed to detect sleep events, they ignore the fact that individuals often share sleeping spaces with roommates or couples. To address this issue, we introduce DreamCatcher, the first publicly available dataset for wearer-aware sleep event algorithm development on earables.


Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Gibier, Marcel, Celton, Nolwenn, Duroselle, Raphaël, Serrano, Pierre, Boeffard, Olivier, Bonastre, Jean-François

arXiv.org Artificial Intelligence

In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.


More Than A Shortcut: A Hyperbolic Approach To Early-Exit Networks

Bhosale, Swapnil, Frateanu, Cosmin, Clark, Camilla, Jasonas, Arnoldas, Mitchell, Chris, Zhu, Xiatian, Ithapu, Vamsi Krishna, Ferroni, Giacomo, Bilen, Cagdas, Parekh, Sanjeel

arXiv.org Artificial Intelligence

Deploying accurate event detection on resource-constrained devices is challenged by the trade-off between performance and computational cost. While Early-Exit (EE) networks offer a solution through adaptive computation, they often fail to enforce a coherent hierarchical structure, limiting the reliability of their early predictions. To address this, we propose Hyperbolic Early-Exit networks (HypEE), a novel framework that learns EE representations in the hyperbolic space. Our core contribution is a hierarchical training objective with a novel entailment loss, which enforces a partial-ordering constraint to ensure that deeper network layers geometrically refine the representations of shallower ones. Experiments on multiple audio event detection tasks and backbone architectures show that HypEE significantly outperforms standard Euclidean EE baselines, especially at the earliest, most computationally-critical exits. The learned geometry also provides a principled measure of uncertainty, enabling a novel triggering mechanism that makes the overall system both more efficient and more accurate than a conventional EE and standard backbone models without early-exits.


Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

Cai, Pengfei, Song, Yan, Gu, Qing, Jiang, Nan, Song, Haoyu, McLoughlin, Ian

arXiv.org Artificial Intelligence

Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at https://cai525.github.io/Transformer4SED/demo_page/DASM/.


Not in Sync: Unveiling Temporal Bias in Audio Chat Models

Yao, Jiayu, Liu, Shenghua, Wang, Yiwei, Cheng, Rundong, Mei, Lingrui, Bi, Baolong, Xiong, Zhen, Cheng, Xueqi

arXiv.org Artificial Intelligence

Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked "At which second does the lecturer introduce the key formula?", models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.


Real-Time Health Analytics Using Ontology-Driven Complex Event Processing and LLM Reasoning: A Tuberculosis Case Study

Chandra, Ritesh, Agarwal, Sonali, Singh, Navjot

arXiv.org Artificial Intelligence

Timely detection of critical health conditions remains a major challenge in public health analytics, especially in Big Data environments characterized by high volume, rapid velocity, and diverse variety of clinical data. This study presents an ontology-enabled real-time analytics framework that integrates Complex Event Processing (CEP) and Large Language Models (LLMs) to enable intelligent health event detection and semantic reasoning over heterogeneous, high-velocity health data streams. The architecture leverages the Basic Formal Ontology (BFO) and Semantic Web Rule Language (SWRL) to model diagnostic rules and domain knowledge. Patient data is ingested and processed using Apache Kafka and Spark Streaming, where CEP engines detect clinically significant event patterns. LLMs support adaptive reasoning, event interpretation, and ontology refinement. Clinical information is semantically structured as Resource Description Framework (RDF) triples in Graph DB, enabling SPARQL-based querying and knowledge-driven decision support. The framework is evaluated using a dataset of 1,000 Tuberculosis (TB) patients as a use case, demonstrating low-latency event detection, scalable reasoning, and high model performance (in terms of precision, recall, and F1-score). These results validate the system's potential for generalizable, real-time health analytics in complex Big Data scenarios.



OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

Biswas, Subrata, Khan, Mohammad Nur Hossain, Islam, Bashima

arXiv.org Artificial Intelligence

Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, $\textbf{OWL}$ supports o'clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$ and improves spatial reasoning QA accuracy by up to $\textbf{25}$\% over BAT.